Automating Exploratory Data Analysis

Paul M Washburn

May 6 2019

auto_eda

Towards Automated Exploration

The primary goal of auto-explore is to to establish a codebase that reduces the effort to produce a reasonable first-pass exploratory data analysis for a variety of dataset types.

This Python library is a first attempt at automating the process of exploratory data analysis – at least as far as computation and visualization is concerned.

Critical thinking is not included.

Potential Benefits Semi-automated EDA

  • Faster time to insights & modeling
  • Shorter exploratory data analysis turnaround
  • Consistent, reliable processes that are vetted & improved over time
  • No need to re-configure old code to new situations
  • Supplies a base for more in-depth analysis

Previous Work in the Space

Overview Simply specify a dataset and a few attributes.

Much of the functionality of this library is to generate visualizations. However there is a great deal of analytical functionality included as well.

Text Analysis

tsne

Time Series

plot_tseries_over_group_with_histograms

Correlation Heatmaps

correlation_heatmap

Categorical Analysis

target_distribution_over_binary_groups

Clustering Analysis

cluster_and_plot_pca1

Clustering Analysis

cluster_and_plot_pca2

Functionality

The Central Object

Simply specify a DataFrame and a list for each of its binary, categorical, numerical and text columns. If applicable, set the target_col as a string.